# NEARLY LOSSLESS CONTENT-DEPENDENT LOW-POWER DCT DESIGN FOR MOBILE VIDEO APPLICATIONS

Chia-Ping Lin, Po-Chih Tseng, and Liang-Gee Chen

DSP/ICDesignLab.,GraduateInstit uteofElectronicsEngineeringand DepartmentofElectricalEngineer ing,NationalTaiwanUniversity {cplin,pctseng,lgchen}@video.ee.ntu.edu.tw

## ABSTRACT

This paper proposes a practical contentdependent lowpower DCT design with tolerable quality drop. Lowpower issue has become more and more important, especially for portable devices. Unfortunately, lowpower design always brings in significant quality drop at the same time. This work not only achieves ultra low power dissipation but also remains tolerable quality drop (about 0.1dB) in most cases. The proposed architecture is based on distributed arithmetic architecture [1] and combined withmorereliablePPA classificational gorithm [2]. It can accurately control bitlevel calculation and avoid unnecessary calculation to save power. This characteristic is powerful and very useful in video encoder systems, since coefficients after DCT and Q become zero with highprobability. This part of power can be saved without causingundesiredqualitydrop.

## **1. INTRODUCTION**

Sincethefirstappearancein[3],discretecosinetransform (DCT)hasbecomethemostwidelyusedtransformcoding technique forvariousimage and videocoding algorithms. Ithas also been adopted by most image and video coding standards, including JPEG ,MPEG1,MPEG2,MPEG4, H.261,H.263, and H.264/AVC.

DCT is a dominating computationintensive task of video encoder systems, only second to motion estimation (ME). For encoder with full search ME case, DCT, quantization (Q), and their inverses together occupy 16% of overall complexity. As for encoder with fast search ME case, DCT/Q/IQ/IDCT occupies more significant 29% of overall complexity. Since video encoder systems generally adopt fast search ME for mobile video applications, the importance of DCT/Q/IQ/IDCT is noticeable. As a result, the minimization of power dissipation of DCT is indispensable in order to achieve a lowpower video encoder system for mobile video applications.

Inadditiontopowerdissipation,thequalityisalsoan essential factor for video encoder systems. Too much quality drop would make DCT implementation impractical, since the accumulative error propagation coulddegradetheencodedqualitydramatically.

In the literature, many lowpower DCT architectures have been proposed. The adopted techniques can be classified into two categories. One is the loss less approach, such as coefficients caling, gated register, and architecture based on distributed arithmetic (DA)[1]. The other is the lossy approach based on content dependent algorithm, such as PPA classification [2]. Although this approach can perform very well in lowering power dissipation, however, the caused quality drop is very significant such as to be come impractical forvide oencoder systems.

TheproposedlowpowerDCTdesigninthispapercan reach tolerable quality drop based on contentdependent algorithm and all lossless techniques to satisfy the lowpower requirement, which makes it very practical and useful in establishing lowpower video encoder systems formobilevideoapplications.

## 2. BACKGROUND





Fig. 1 shows the most commonly used rowcolumn method for 2DDCT, by which the 2D8x8DCT can be realized with 16 passes of 1D 8 point DCT. Besides, by applying first-level even/odd decomposition, the 1D 8-pointDCT can be decomposed as follows.

$$\begin{bmatrix} Y(0) \\ Y(2) \\ Y(4) \\ Y(6) \end{bmatrix} = \begin{bmatrix} a & a & a & a \\ c & f & -f & -c \\ a & -a & -a & a \\ f & -c & c & -f \end{bmatrix} \begin{bmatrix} X(0) + X(7) \\ X(1) + X(6) \\ X(2) + X(5) \\ X(3) + X(4) \end{bmatrix}$$

$$\begin{bmatrix} Y(1) \\ Y(3) \\ Y(5) \\ Y(7) \end{bmatrix} = \begin{bmatrix} b & d & e & g \\ d & -g & -b & -e \\ e & -b & g & d \\ g & -e & d & -b \end{bmatrix} \begin{bmatrix} X(0) - X(7) \\ X(1) - X(6) \\ X(2) - X(5) \\ X(3) - X(4) \end{bmatrix}$$

where

DA is a bitserial operation that computes the inner product of two vectors (one of which is a constant) in parallel without any multiplication. It uses the ROM and accumulator (RAC) structure to substitute the constantcoefficient multiplier, as shown in Fig. 2. The advantage of DA is its low power dissipation, but it may encounter the mismatch condition. For 1D 8point DCT, if the bitwidth of input data is larger than 8bit, then the mismatch condition of DA leads to precision loss with directtruncation.



Fig.2.TheROMandaccumulator(RAC)structureofDA

## **3. LOSSLESS APPROACH**

There are some techniques that can simplify the DCT computationwithoutanyqualitydrop.

## 3.1. Coefficient Scaling

The adopted flow graph with scaled coefficient and DC/AC4 butterfly technique is to scale the DCT coefficient matrix with the constant a and further apply allevel even/odd decomposition specific for DC and AC4 frequencies. After the scaling of DCT coefficient matrix with the constant a, the matrixvector multiplications for DC and AC4 frequencies only involve additions and subtractions. With the further application of allevel even/odd decomposition specific for DC and AC4 frequencies, which results in the DC/AC4 butterfly as shown in Fig. 3, the DC and AC4 frequencies can be implemented by only four adders/subtractors instead of two RACs. This technique has two advantages. First, the mismatch condition can be alleviated. Since the DC

frequency is implemented by bitparallel adders, there is no mismatch condition for the DC frequency, which is originally themostcritical frequency component. Second, the computational complexity can be reduced. Since the RAC of DC frequency usually has more bits to be computed, the implementation of DC and AC4 frequencies with only four adders/subtractors can thus reduce the required arithmetic operations.



Fig.3.Proposed1D8pointDCTcore

#### 3.2. Selective Gated Registers

Ratherthanusingtheshiftregisterstoshifttheinputdata and the data after the butterfly structure in [2], the proposeddesignadoptstheselectivegatedregisters(SGR) instead of shifter registers for registers D0 to D7 as well astheevenregistersE0toE3andoddregistersO1toO3 afterthebutterflystructure.Everyclockcycleonlyoneof the eightregisters D0 to D7 is selected, and the others are powered down by clock gating. Compared to using the shift registers, using the selective gated registers for registersD0toD7canreducethepowerdissipationtoone eighthinidealcase.Besides,thefourevenregistersE0to E3andfouroddregistersO1toO3areonlyactivatedfor one cycle within every eight cycles, and all the eight registers are powered down by clock gating for other seven cycles. This can significantly lower the power dissipation of shift registers originally used for the data afterthebutterflystructure.

Besides, the selective gated register array is also adopted to replace the transpose memory. Because of the parallel architecture, the transpose register array is only activeonecyclepereveryeightcycles, which saves much unnecessary powerdissipation.

## 4. CONTENT-DEPENDENT ALGORITHM

The contentdependent PPA (peaktopeak pixel amplitude) algorithm for DCT was first proposed by [2]. This algorithm can approximate unnecessary computation of AC frequencies with good performance. Based on PPA, anadvanced input classifier (AIC) is developed. Together with the dynamic effective bitwidth extraction (DEBE) technique, the proposed content dependent algorithm further improves the performance of PPA and makes it more suitable for video encoder systems.

### 4.1. Advanced Input Classifier

The advanced input classifier can precisely approximate thezeroorlowprecisionoutputdatabasedoninputdata, MBmode, and QP. For a given MB mode, the advanced input classifier can reduce the bits to be computed according to the signal content variations of input data (with PPA criterion) as well as the chosen OP, and four threshold classes are defined to decide the maximum bitwidth to be computed of each RAC. Therefore, there are two sets of four threshold classes, one for intra MB mode and the other for inter MB mode. Rather than only being a function of input data, the criterion of advanced inputclassifierforthresholdingisafunctionofbothinput data and QP. The threshold values are determined by exhaustive simulations, and the H.263 quantization methodasdefinedin[4]isadopted.Besides,thethreshold values of the second stage 1D 8x1 DCT are two times larger than those of the first stage. In order to reduce the control overhead, the same threshold values are used for theRACsofoddACfrequencies(RAC1/3/5/7).

#### 4.2. Dynamic Effective Bitwidth Extraction

Rather than using the direct truncation of bitwidth for mismatchconditionin[2],thedynamiceffectivebitwidth extraction technique, cooperated with above three techniques, is adopted to reject the bits of sign extension for the data after the butterfly structure while also carefully deal with the mismatch condition. The data after the butterfly structure are stored in the even registers E0 to E3 and odd registers O1 to O3 without truncation of bitwidth. These data are then processed by the dynamic effectivebitwidthextractionwiththeinformationfrom the advanced input classifier, as shown in Fig. 3. The dynamiceffectivebitwidthextractioncanfirstidentifythe effective bitwidth by rejecting the bits of sign extension and reducing the bits after advanced input classifier. Since the most critical DC frequency is implemented by bitparallel adders, the effective bitwidth for other AC

frequencies (AC1/2/3/5/6/7) are usually less than eight. The dynamic effective bitwidth extraction can then dynamically extract bits from the effective bitwidth range, onebitpercycle,toRAC1/3/5/7andRAC2/6.Onlywhen the effective bitwidth is larger than eight, the possible truncation of bitwidth could occur. This approach performsmuchbetterthandirecttruncationadoptedby[2] in terms of the accuracy of output data, and thus the quality drop can be effectively reduced. After the bits of effective bitwidth range have all been extracted, clock gatingisthenappliedtopowerdownthedynamicallyidle circuits. Since the same threshold values are used for the RACs of odd frequencies (RAC1/3/5/7), there is only a single pair of control signal for these four RACs. By above four advanced techniques, the three drawbacks of [2], such as undesired significant output quality drop caused by mismatch condition, the input classifier that can only handle the predefined QP case, and powerinefficientshift

registers, can be effectively overcome, and thus a more practical and efficient contentdependent DCT design is achieved.

#### 5. SIMULATION RESULT

Fig. 4. shows the simulation result of quality drop. The test sequences are stefan, weather, mobile, and foreman. The targeted video encoder system is MPEG4 simple profile (SP) with predictive four step search ME, GOP= 30, and IPPP format. When QP is larger than 8, the qualitydropareallwithin0.1dBcompared withfloating-pointDCT.



Fig.4.QualitydropofproposedDCTdesign

Fig. 5. shows the simulation result of computation cost. Because DC and AC4 are without RACs, this simulationonlymeasuresthenumberofbitsneededtobe performed in the RACs of other AC frequencies. As can be seen that the proposed algorithm can reduce 50% computationinaverage.FromFig.4andFig.5,whenQP becomes larger, the quality drop remainst olerable and the

computation can be reduced. It means that this algorithm can effectively save unneeded power and does not cause sacrificed quality at the same time.



#### 6. IMPLEMENTATION RESULT

TableI:Gatelevelimplementationresult.

| Area(GateCount) | Power(mW@1.8V,33MHz)        |  |  |
|-----------------|-----------------------------|--|--|
| 28976           | 6.03(worsecase,IntraQP=4)   |  |  |
|                 | 3.77(normalcase,InterQP=12) |  |  |

The proposed content dependent lowpower DCT design hasbeenimplementedbyfrontendcellbaseddesignflow and synthesized by Artisan standard cell library based on UMC 0.18 µm 1P6M CMOS process. Table I shows the gate-level implementation result. The area is estimated in termsofsynthesizedgatecount, and the power dissipation is estimated by Synopsys PrimePower gatelevel power estimation with the unit of mW @ 1.8 V, 33MHz. Since the proposed DCT design adopts contentdependent algorithm, different signal content variations of input data can result in different power dissipations. Therefore, two kinds of input data have been used for the power estimation. One belongs to the worse case in which the 8x8blocksareofintraMBmodeandatQP=4,whilethe other belongs to the normal case in which the \$x\$ blocksare of inter MB mode and at QP = 12. As can be seen from Table I, the contentdependent DCT design in normal case consumes lower power dissipation (63%) than inworse case. These power data are estimated under 1.8Vand33MHz.ForCIF30fps,theproposedcontentdependent lowpower DCT design is only required to operate at 4.56 MHz. As a result, the static voltage/frequency scaling can be applied, and the power dissipation after scaling to 1.2 V and 4.56 MHz is estimated to be 910  $\mu$ W in worse case or 473  $\mu$ W in normalcase.

Fig. 6 is the power breakdown in worse case and in normal case. Fig. 7 is the further power analysis of 1D DCTcore.ItclearlyshowsthatthepowerofRACscanbe dramaticallyreduced.Andinnormalcase,onlyinevitable power like transpose register array and input/output registersdominatestheDCTpowerdissipation.



Fig.6.PowerbreakdownofproposedDCTdesign



Fig.7.Powerbreakdownofproposed1DDCTcore.

| TableII:Comparisonwithpriorarts |                |       |                        |             |  |
|---------------------------------|----------------|-------|------------------------|-------------|--|
| Design                          | Parameter      | Area  | Power(µW@MSamples/sec) | QualityDrop |  |
| Proposed                        | 0.18µm,1.8V,CB | 28976 | 181(worse),113(normal) | 0.1dB       |  |
| Xanthopoulos[2]                 | 0.6µm,1.56V,FC | 30K   | 313                    | ~4dB        |  |
| Fanucci[5]                      | 0.18µm,1.6V,CB | 30K   | 629                    | No          |  |
| August[6]                       | 0.18µm,1.8V,CB | 15K   | 154                    | 1.32dB      |  |

### 7. CONCLUSION

TableII lists the comparison with priorarts. It shows that this work has the better balance between power and quality. With only 0.1dB quality drop, this work can achieve the best power efficiency compared to priorarts. This makes it become a more practical and efficient content dependent DCT design, which is very suitable for powere on strained devices for mobile video applications.

#### 8. REFERENCES

[1]A.PeledandB.Liu, "Anew hardware realization of digital filter," *IEEE Transactions on Acoustics, Speech, and Signal Processing*, vol.22, no.2, pp.456462, Dec. 1974.

[2] T. Xanthopoulos and A. P. Chandrakasan, "A lowpower DCT core using adaptive bitwidth and arithmetic activity exploitingsignal correlations and quantization," *IEEE Journal of Solid-State Circuits*,vol.35,no.5,pp.740750,May.2000.

[3] N. Ahmed, T. Natarajan, a nd K. R. Rao, "Discrete cosine transform," *IEEE Transactions on Computers*, vol.23, no.1, pp. 9093, Jan. 1974.

[4] Information Technology – Coding of Audio-Visual Objects --Part 2: Visual, ISO/IEC144962, 1999.

[5]L.FanucciandS.Saponara, "DatadrivenVLSIcomputation

for lowpower DCTbased video coding," in *Proc. of International Conference on Electronics, Circuits, and Systems*, 2002,pp.541–544.

[6]N.J.AugustandD.S.Ha ,"LowpowerdesignofDCT and IDCT for low bit rate video codecs," *IEEE Transactions on Multimeida*,vol.6,no.3,pp.414–422,June2004.